Web Data Identification and Extraction
نویسنده
چکیده
Nowadays, with the rapid growth of the web, a large volume of data and information are published in numerous web pages. As web sites are getting more complicated, the construction of web information extraction systems becomes more difficult and time-consuming. In this paper proposes a new method to perform the task automatically which is more effective than machine learning and semi automated system. The proposed method consists of two steps, (1) identifying individual data records in a page, and (2) aligning and extracting data items from the identified data records. For step 1, we propose a method based on visual information to segment data records, which is more accurate than existing methods. For step 2, we propose a novel partial alignment technique based on tree matching. Partial alignment means that we align only those data fields in a pair of data records that can be aligned (or matched) with certainty, and make no commitment on the rest of the data fields. Keywords-—Web mining, Web data extraction, alignment, data records. 1-Introduction DATA MINING Data mining is emerging as one of the key features of many homeland security initiatives. Often used as a means for detecting fraud, assessing risk, and product retailing, data mining [1] involves the use of data analysis tools to discover previously unknown, valid patterns and relationships in large data sets. In the context of homeland security, data mining is often viewed as a potential means to identify terrorist activities, such as money transfers and communications, and to identify and track individual terrorists themselves, such as through travel and immigration records. WEB MINING Web mining is the application of data mining techniques to discover patterns from the Web. According to analysis targets, web mining can be divided into three different types, which are Web usage mining, Web content mining and Web structure mining. WEB MINING USAGE Web usage mining is the process of extracting useful information from server logs i.e. users history. Web usage mining is the process [2] of finding out what users are looking for on the Internet. Some users might be looking at only textual data, whereas some others might be interested in multimedia data. STRUCTURE OF WEB MINING Web structure mining is the process of using graph theory to analyze the node and connection structure of a web site. According to the type of web structural data, web structure mining can be divided into two kinds: 1. Extracting patterns from hyperlinks in the web: a hyperlink is a structural component that connects the web page to a different location. 2. Mining the document structure: analysis of the tree-like structure of page structures to describe HTML or XML tag usage. WEB DATA IDENTIFICATION AND EXTRACTION While data mining products can be very powerful tools, they are not self sufficient applications. To be successful, data mining requires skilled technical and analytical specialists who [3] can structure the analysis and interpret the output that is created. Consequently, the limitations of data mining are primarily data or personnel related, rather than technology-related. Structured data objects are a very important type of information on the Web. Such data objects are often records from underlying databases and displayed in Web pages with some fixed templates. In this paper, we also call them data records. Our objective is twofold: (1) automatically identify such data records in a page, and (2) automatically align and extract data items from the data records. The method first segments the page to identify each 1863 Web Data Identification and Extraction ISSN 2277-1956/V1N3-1862-1869 data record without extracting its data items and method also uses visual cues to find data records. Visual information helps the system in two ways: (i) It enables the system to identify gaps that separate data records, which helps to segment data records correctly because the gap within a data record (if any) is typically smaller than that in between data records. (ii) The proposed system identifies data records by analyzing HTML tag trees or DOM trees. A straightforward way to build a tag tree is to follow the nested tag structure in the HTML code. A novel partial tree alignment method is proposed to align and to extract corresponding data items from the discovered data records and put the data items in a database table. Using tree alignment is natural because of the nested (or tree structured) organization of HTML code. Specifically, after all data records have been identified, the sub-trees of each data record are re-arranged into a single tree as each data record may be contained in more than one subtree in the original tag tree of the page, and each data record may not be contiguous. The tag trees of all the data records are then aligned using our partial alignment method. The resulting alignment enables us to extract data items from all data records in the page. It can also serve as an extraction pattern to be used to extract data items from other pages with data records generated using the same template. ADVANTAGES OF DATA MINING Marketing / Retail Data mining helps marketing companies to build models based on historical data to predict who will respond to new marketing campaign such as direct mail, online marketing campaign and etc. Through this prediction, marketers can have appropriate approach to sell profitable products to targeted customers with high satisfaction. Finance / Banking Data mining gives financial institutions information about loan information and credit reporting. By building a model from previous customer’s data with common characteristics, the bank and financial can estimate what are the god and/or bad loans and its risk level. In addition, data mining can help banks to detect fraudulent credit card transaction to help credit card’s owner prevent their losses. DISADVANTAGES Privacy Issues The concerns about the personal privacy have been increasing enormously recently especially when internet is booming with social networks, e-commerce, forums, blogs.... Because of privacy issues, people are afraid of their personal information is collected and used in unethical way that potentially causing them a lot of trouble. Businesses collect information about their customers in many ways for understanding their purchasing behaviors trends. However businesses don’t last forever, some days they may be acquired by other or gone. At this time the personal information they own probably is sold to other or leak. Misuse of information/inaccurate information Information collected through data mining intended for marketing or ethical purposes can be misused. This information is exploited by unethical people or business to take benefit of vulnerable people or discriminate against a group of people. DISADVANTAGES OF EXISTING SYSTEM When multiple pages are given, the extraction target aims at page-wide information. When single pages are given, the extraction target is usually constrained to record wide information Page-level extraction tasks are much more complicated than record-level extraction tasks since more data are concerned. It is time-consuming and it exploits only structural information to measure the similarity; visual information is recommended and important for such similarities. This approach assumes that all of the peer nodes must be in the same DOM tree level which is not true for all Web sites. 1864 Web Data Identification and Extraction ISSN 2277-1956/V1N3-1862-1869 ADVANTAGES OF PROPOSED SYSTEM Proposed system enables the system to identify gaps that separate data records, which helps to segment data records correctly because the gap within a data record (if any) is typically smaller than that in between data records. Identifies data records by analyzing HTML tag trees. A straightforward way to build a tag tree is to follow the nested tag structure in the HTML code. Tree alignment is natural because of the nested organization of HTML code. Discovered data records and put the data items in a database table. Proposed method leads to more robust tree construction due to the high error tolerance of the rendering engines of Web browsers.
منابع مشابه
Data Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملExtraction of Flat and Nested Data Records from Web Pages
This paper studies the problem of identification and extraction of flat and nested data records from a given web page. With the explosive growth of information sources available on the World Wide Web, it has become increasingly difficult to identify the relevant pieces of information, since web pages are often cluttered with irrelevant content like advertisements, navigation-panels, copyright n...
متن کاملFunctionality-Based Web Image Categorization
The World Wide Web provides an increasingly powerful and popular publication mechanism. Web documents often contain a large number of images serving various different purposes. Identifying the functional categories of these images has important applications including information extraction, web mining, web page summarization and mobile access. This paper describes a study on the functional cate...
متن کاملExtraction of Data from Web Pages: A Vision Based Approach
With the explosive growth of information sources available on the World Wide Web, it has become increasingly difficult to identify the relevant pieces of information, since web pages are often cluttered with irrelevant content like advertisements, navigation-panels, copyright notices etc., surrounding the main content of the web page. Hence, tools for the mining of data regions, data records an...
متن کاملEfficient Statement Identification for Automatic Market Forecasting
Strategic business decision making involves the analysis of market forecasts. Today, the identification and aggregation of relevant market statements is done by human experts, often by analyzing documents from the World Wide Web. We present an efficient information extraction chain to automate this complex natural language processing task and show results for the identification part. Based on t...
متن کاملPerson Name Identification in Chinese Documents Using Finite State Automata
This research is about automatic identification and extraction of person names in Chinese text documents. Solutions to this problem have immediate and extensive applications in many areas especially in Web Intelligent Agents related applications such as Web search engines, Web data mining, and automatic Web information analysis. We have noted that while finite state automata (FSA) based techniq...
متن کامل